#### GPU Algorithms I The Early Years

MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models

> Suresh Venkatasubramanian University of Utah

#### Motivation



Fast Computation of Generalized Voronoi Diagrams Using Graphics Hardware [SIGGRAPH 1999]

#### Motivation

#### Business Dav The New Hork Times Technology TECHNOLOGY SCIENCE SPORTS WORLD U.S. N.Y. / REGION BUSINESS HEALTH OPINION From PlayStation to Supercomputer for \$50,000 By JOHN MARKOFF May 26, 2003 Published: May 26, 2003 As perhaps the clearest evidence vet of the computing power of

sophisticated but inexpensive video-game consoles, the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign has assembled a supercomputer from an army of Sony PlayStation 2's.

| $\boxtimes$ | E-MAIL        |
|-------------|---------------|
|             | SEND TO PHONE |
| ₽           | PRINT         |
|             |               |

The resulting system, with components purchased at retail prices, cost a little more than \$50,000. The center's researchers believe the system may be capable of a half trillion operations a second, well within the definition of supercomputer, although it may not rank among the world's 500 fastest supercomputers.

Perhaps the most striking aspect of the project, which uses the open source Linux operating system, is that the only hardware engineering involved was placing 70 of the individual game machines in a rack and plugging them together with a high-speed Hewlett-Packard network switch. The center's scientists bought 100 machines, but are holding 30 in reserve, possibly for high-resolution display application.

#### Animusic Demo

#### Overview

- GPUs today have 10s of cores (soon, 100s !)
- Have huge data bandwidth (100s of GB/s)
- Force a SIMD-style mode of computation
- HPC on the cheap !

However,

- GPU models constantly changing
- Almost no algorithmic work on GPUs
- Disconnect between programming model and execution model

#### Overview

- GPUs today have 10s of cores (soon, 100s !)
- Have huge data bandwidth (100s of GB/s)
- Force a SIMD-style mode of computation
- HPC on the cheap !

However,

- GPU models constantly changing
- Almost no algorithmic work on GPUs
- Disconnect between programming model and execution model

# No proofs!













A streaming model









#### Coined by John Backus



#### Coined by John Backus

Has always been a problem (regardless of technology)



#### Coined by John Backus

Has always been a problem (regardless of technology)

GPU design is an attempt to get around it



• Systolic arrays move data between a grid of processing elements.



- Systolic arrays move data between a grid of processing elements.
- Vector processors operate on multiple *words* at a time

- Systolic arrays move data between a grid of processing elements.
- Vector processors operate on multiple *words* at a time
- SIMD processors have a fixed program that runs on different streams



- Systolic arrays move data between a grid of processing elements.
- Vector processors operate on multiple *words* at a time
- SIMD processors have a fixed program that runs on different streams



- Systolic arrays move data between a grid of processing elements.
- Vector processors operate on multiple *words* at a time
- SIMD processors have a fixed program that runs on different streams



Flynn's taxonomy

- Systolic arrays move data between a grid of processing elements.
- Vector processors operate on multiple *words* at a time
- SIMD processors have a fixed program that runs on different streams



Flynn's taxonomy

- Systolic arrays move data between a grid of processing elements.
- Vector processors operate on multiple *words* at a time
- SIMD processors have a fixed program that runs on different streams



Flynn's taxonomy

- Systolic arrays move data between a grid of processing elements.
- Vector processors operate on multiple *words* at a time
- SIMD processors have a fixed program that runs on different streams

#### **GPU** Basics

• •









Vertex pipeline







![](_page_34_Figure_1.jpeg)

![](_page_35_Figure_1.jpeg)
### The GPU Pipeline



• Every pixel acts like an SIMD processor

- *Every pixel* acts like an SIMD processor
  - actually, some fixed number are processed in parallel

- Every pixel acts like an SIMD processor
  - actually, some fixed number are processed in parallel
- Fragment processor could perform simple *straight-line operations* and conditionals (no looping)

- Every pixel acts like an SIMD processor
  - actually, some fixed number are processed in parallel
- Fragment processor could perform simple *straight-line operations* and conditionals (no looping)
- (limited) texture memory for local storage

- Every pixel acts like an SIMD processor
  - actually, some fixed number are processed in parallel
- Fragment processor could perform simple *straight-line operations* and conditionals (no looping)
- (limited) texture memory for local storage
- Each pixel processor could do a simple reduce (add, blend)

- Every pixel acts like an SIMD processor
  - actually, some fixed number are processed in parallel
- Fragment processor could perform simple *straight-line operations* and conditionals (no looping)
- (limited) texture memory for local storage
- Each pixel processor could do a simple reduce (add, blend)
- Computation initiated by "rendering call" from host machine.

- Every pixel acts like an SIMD processor
  - actually, some fixed number are processed in parallel
- Fragment processor could perform simple *straight-line operations* and conditionals (no looping)
- (limited) texture memory for local storage
- Each pixel processor could do a simple reduce (add, blend)
- Computation initiated by "rendering call" from host machine.
- All computation resides on GPU from start of the vertex pipeline

- Every pixel acts like an SIMD processor
  - actually, some fixed number are processed in parallel
- Fragment processor could perform simple *straight-line operations* and conditionals (no looping)
- (limited) texture memory for local storage
- Each pixel processor could do a simple reduce (add, blend)
- Computation initiated by "rendering call" from host machine.
- All computation resides on GPU from start of the vertex pipeline
- Computation proceeds in *passes*: output could be rendered or stored in memory for next pass.

Simple GPU Algorithms

• •

•



















Voronoi diagram is *lower envelope* of collection of distance functions

• For each point, render a *cone* of colored triangles

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value
- Fragment processor at (x, y) receives stream of pixels  $p_1, p_2, \ldots$

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value
- Fragment processor at (*x*, *y*) receives stream of pixels *p*<sub>1</sub>, *p*<sub>2</sub>, . . .

 $\text{min} \gets 0$ 

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value
- Fragment processor at (*x*, *y*) receives stream of pixels *p*<sub>1</sub>, *p*<sub>2</sub>, . . .

 $\min \leftarrow 0$ **if** depth( $p_i$ ) < min **then** 

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value
- Fragment processor at (x, y) receives stream of pixels  $p_1, p_2, \ldots$

```
\min \leftarrow 0
if depth(p_i) < min then
{GPU Z-test}
```

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value
- Fragment processor at (x, y) receives stream of pixels  $p_1, p_2, \ldots$

```
 \begin{array}{l} \min \leftarrow 0 \\ \text{if depth}(p_i) < \min \text{ then} \\ \{ \text{GPU Z-test} \} \\ \min = \text{depth}(p_i) \ \{ \text{GPU blending operation} \} \end{array}
```

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value
- Fragment processor at (*x*, *y*) receives stream of pixels *p*<sub>1</sub>, *p*<sub>2</sub>, . . .

```
\min \leftarrow 0
if depth(p_i) < min then
{GPU Z-test}
min = depth(p_i) {GPU blending operation}
color(x, y) = color(p_i)
```

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value
- Fragment processor at (*x*, *y*) receives stream of pixels *p*<sub>1</sub>, *p*<sub>2</sub>, . . .

```
\begin{array}{l} \min \leftarrow 0 \\ \text{if } \operatorname{depth}(p_i) < \min \text{ then} \\ \{ \operatorname{GPU} Z \text{-test} \} \\ \min = \operatorname{depth}(p_i) \{ \operatorname{GPU} \text{ blending operation} \} \\ \operatorname{color}(x, y) = \operatorname{color}(p_i) \\ \text{end if} \end{array}
```

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value
- Fragment processor at (*x*, *y*) receives stream of pixels *p*<sub>1</sub>, *p*<sub>2</sub>, . . .

```
\begin{array}{l} \min \leftarrow 0 \\ \text{if } \operatorname{depth}(p_i) < \min \text{ then} \\ \{ \operatorname{GPU} \operatorname{Z-test} \} \\ \min = \operatorname{depth}(p_i) \{ \operatorname{GPU} \text{ blending operation} \} \\ \operatorname{color}(x, y) = \operatorname{color}(p_i) \\ \text{end if} \end{array}
```

• Rendering engine is the *mapper* 

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value
- Fragment processor at (*x*, *y*) receives stream of pixels *p*<sub>1</sub>, *p*<sub>2</sub>, . . .

```
\begin{array}{l} \min \leftarrow 0 \\ \text{if depth}(p_i) < \min \text{ then} \\ \{\text{GPU Z-test}\} \\ \min = \text{depth}(p_i) \{\text{GPU blending operation}\} \\ \operatorname{color}(x, y) = \operatorname{color}(p_i) \\ \text{end if} \end{array}
```

- Rendering engine is the *mapper*
- *Gathering* happens automatically, with fixed key (x, y)

- For each point, render a *cone* of colored triangles
  - Use many triangles to approximate smooth cone
  - Use shading to encode distance as color value
- Fragment processor at (*x*, *y*) receives stream of pixels *p*<sub>1</sub>, *p*<sub>2</sub>, . . .

```
\begin{array}{l} \min \leftarrow 0 \\ \text{if depth}(p_i) < \min \text{ then} \\ \{\text{GPU Z-test}\} \\ \min = \text{depth}(p_i) \{\text{GPU blending operation}\} \\ \operatorname{color}(x, y) = \operatorname{color}(p_i) \\ \text{end if} \end{array}
```

- Rendering engine is the *mapper*
- *Gathering* happens automatically, with fixed key (x, y)
- Fragment processors implement reduce

### Simple GPU Matrix Multiplication

$$C = A \cdot B$$
$$C_{ij} = \sum_{k} A_{ik} B_{kj}$$

### Simple GPU Matrix Multiplication



#### Simple GPU Matrix Multiplication


#### Simple GPU Matrix Multiplication



• GPU loops have to be unrolled

#### Simple GPU Matrix Multiplication



- GPU loops have to be unrolled
- This works only if *k* is small







• Multiple passes increase number of memory access, but help with caching



- Multiple passes increase number of memory access, but help with caching
- Each "field" is actually four values



- Multiple passes increase number of memory access, but help with caching
- Each "field" is actually four values
- Can get a factor 4 speedup with careful partitioning of matrix



- Multiple passes increase number of memory access, but help with caching
- Each "field" is actually four values
- Can get a factor 4 speedup with careful partitioning of matrix
- More complicated methods needed for sparse multiplication

• Compute parallelism is not enough

- Compute parallelism is not enough
- Need SIMD structures (find repeated instruction patterns)

- Compute parallelism is not enough
- Need SIMD structures (find repeated instruction patterns)
- External memory methods manage I/O bottlenecks but don't exploit compute power.

- Compute parallelism is not enough
- Need SIMD structures (find repeated instruction patterns)
- External memory methods manage I/O bottlenecks but don't exploit compute power.
- Plan: Use *sorting networks:*

- Compute parallelism is not enough
- Need SIMD structures (find repeated instruction patterns)
- External memory methods manage I/O bottlenecks but don't exploit compute power.
- Plan: Use *sorting networks:* 
  - Many simple (and local) compute elements

- Compute parallelism is not enough
- Need SIMD structures (find repeated instruction patterns)
- External memory methods manage I/O bottlenecks but don't exploit compute power.
- Plan: Use *sorting networks:* 
  - Many simple (and local) compute elements
  - High-throughput and synchronous





















Bitonic sort requires  $\log^2 n$  layers, n/2 comparators/layer







• Fill 2D array with values



- Fill 2D array with values
- (for each pass) construct quadrilateral with lookup values



- Fill 2D array with values
- (for each pass) construct quadrilateral with lookup values
- Texture hardware locates lookup values, and fragment program does comparisons



- Fill 2D array with values
- (for each pass) construct quadrilateral with lookup values
- Texture hardware locates lookup values, and fragment program does comparisons
- $\log^2 n$  passes used to complete the computation

#### Review

- Brief history of GPU model
- Simple GPU SIMD model
- Examples: Voronoi diagrams, matrix multiplication and sorting

- More simple GPU examples
- Toy example of algorithmic view: "GPU as streaming processor"
- The CUDA model for modern GPUs
- "Hello world" example: matrix multiplication in CUDA

# Questions?